The Serialization of Heterogeneous Documents
نویسندگان
چکیده
Tasks involving the analysis of natural language are typically conducted on a corpus or corpora of plain text. However, it is rare that a document is unstructured and freeform in its entirety. Documents such as corporate disclosures, medical journals and other knowledge rich archive contain structured and loosely-structured information that can be used in a variety of important text mining tasks. In this paper we propose a syntactical preprocessing architecture to serialize presentationoriented documents to a machine readable format that aspires to preserve the document structure, contents and metadata. We introduce a hybrid pipeline architecture, discussing the various processes and the future research direction that could potentially lead to a holistic representation of heterogeneous documents.
منابع مشابه
Energy-optimized Data Serialization For Heterogeneous WSNs Using Middleware Synthesis
Developing applications for resourceconstrained devices is an intricate task in itself and additionally requires in-depth domain expertise to optimize aspects such as communication overhead, resource usage and energy consumption. Frequently, these refinements are omitted because they are time-consuming, laborious and error-prone. Hence, automating these aspects lets developers and applications ...
متن کاملXML Binary Serialization using Cross-Format Schema Protocol (XFSP) and XML Compression Considerations for Extensible 3D (X3D) Graphics
The NPS Cross-Format Schema Protocol (XFSP) has been developed as a general approach to binary serialization of XML documents. Elements and attributes are replaced via a tokenization scheme which carefully preserves valid XML document structure. XFSP uses XML schema as the basis for determining key document parameters such as legal elements, attributes and data types. Originally motivated by th...
متن کاملA symbol spotting approach in graphical documents by hashing serialized graphs
In this paper we propose a symbol spotting technique in graphical documents. Graphs are used to represent the documents and a (sub)graph matching technique is used to detect the symbols in them. We propose a graph serialization to reduce the usual computational complexity of graph matching. Serialization of graphs is performed by computing acyclic graph paths between each pair of connected node...
متن کاملRepository for Business Processes and Arbitrary Associated Metadata
We have published a repository for storing business processes and associated metadata. The BPEL Repository is an Eclipse plug-in originally built for BPEL business processes and other related XML data. It provides a framework for storing, finding and using these documents. Other research prototypes can reuse these features and build on top of it. The repository can easily be extended with new t...
متن کاملLess Destructive Cleaning of Web Documents by Using Standoff Annotation
Standoff annotation, that is, the separation of primary data and markup, can be an interesting option to annotate web pages since it does not demand the removal of annotations already present in web pages. We will present a standoff serialization that allows for annotating wellformed web pages with multiple annotation layers in a single instance, easing processing and analyzing of the data.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015